Integrating Data and Probabilistically Structured Text Documents
نویسندگان
چکیده
Commercial, non-profit and public organizations are accumulating huge amounts of electronically available text documents. Although composed of unstructured texts, documents contained in archives such as annual reports to shareholders, medical patient records and public announcements often share an inherent, though undocumented structure. In order to enable information integration of text collections with related structured data sources, this inherent structure should be made explicit as detailed as possible. The goal of this study is the establishment of a methodology for the integration of text documents with structured records into a hyper-archive of application-specific entities. The text documents are of implicit structure which has been explicated by data mining techniques as proposed in the DIAsDEM framework for semantic tagging of domain-specific text documents. The result is a probabilistic DTD that serves as a basis for the matching of schemata and for the matching of data instances.
منابع مشابه
Text Analytics to Data Warehousing
─ Information hidden or stored in unstructured data can play a critical role in making decisions, understanding and conducting other business functions. Integrating data stored in both structured and unstructured formats can add significant value to an organization. With the extent of development happening in Text Mining and technologies to deal with unstructured and semi structured data like X...
متن کاملIntegrating a Structured-Text Retrieval System with an Object-Oriented Database System
We describe the integration of a structured-text retrieval system (TextMachine) into an object-oriented database system (OpenODB). Our approach is a light-weight one, using the external function capability of the database system to encapsulate the text retrieval system as an external information source. Yet, we are able to provide a tight integration in the query language and processing; the us...
متن کاملExploiting Evidence from Unstructured Data to Enhance Master Data Management
Master data management (MDM) integrates data from multiple structured data sources and builds a consolidated 360degree view of business entities such as customers and products. Today’s MDM systems are not prepared to integrate information from unstructured data sources, such as news reports, emails, call-center transcripts, and chat logs. However, those unstructured data sources may contain val...
متن کاملLearning to Classify Text from Labeled and Unlabeled Documents
In many important text classification problems, acquiring class labels for training documents is costly, while gathering large quantities of unlabeled data is cheap. This paper shows that the accuracy of text classifiers trained with a small number of labeled documents can be improved by augmenting this small training set with a large pool of unlabeled documents. We present a theoretical argume...
متن کاملUsing EM to Classify Text from Labeled and Unlabeled Documents
This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is significant because in many important text classification problems obtaining classification labels is expensive, while large quantities of unlabeled documents are readily available. We present a theoretical ar...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001